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Abstract 


We propose a totally corrective boosting algorithm with explicit cardinality regularization. The re¬ 
sulting combinatorial optimization problems are not known to be efficiently solvable with existing 
classical methods, but emerging quantum optimization technology gives hope for achieving sparser 
models in practice. In order to demonstrate the utility of our algorithm, we use a distributed classical 
heuristic optimizer as a stand-in for quantum hardware. Even though this evaluation methodology 
incurs large time and resource costs on classical computing machinery, it allows us to gauge the 
potential gains in generalization performance and sparsity of the resulting boosted ensembles. Our 
experimental results on public data sets commonly used for benchmarking of boosting algorithms 
decidedly demonstrate the existence of such advantages. If actual quantum optimization were to 
be used with this algorithm in the future, we would expect equivalent or superior results at much 
smaller time and energy costs during training. Moreover, studying cardinality-penalized boosting 
also sheds light on why unregularized boosting algorithms with early stopping often yield better 
results than their counterparts with explicit convex regularization; Early stopping performs subop- 
timal cardinality regularization. The results that we present here indicate it is beneficial to explicitly 
solve the combinatorial problem still left open at early termination. 


Keywords: boosting, ensemble methods, sparsity, cardinality penalization, non-convex optimiza¬ 
tion, quantum optimization 

1. Introduction 

Boosting algorithms execute the following protocol in each iteration (Freund and Schapire, 1997; 
Freund, 1998). The algorithm provides a distribution u G M>g on a given set of m training exam¬ 
ples. Then an oracle provides a weak hypothesis from some hypothesis class and the distribution is 
updated. At the end, the algorithm arrives at a linear combination w* G M>q of weak hypotheses, 
where n is the number of all possible weak hypotheses. One can view boosting as a zero-sum game 
between a row and a column player (Warmuth et ah, 2006). Each possible hypothesis provided by 
the oracle is a column of an underlying game matrix fL G {—1,that represents the entire 
hypothesis class available to the oracle. The training examples correspond to rows Tfi- of this ma- 


trix. UiHi-w is the margin of example i with label yi w.r.t. the linear combination w of the weak 
hypotheses in T-L. From an optimization perspective, one can view different boosting algorithms as 
column generation approaches (Demiriz et ah, 2002) for solving regularized risk minimization: 

m 

UyiHi-w) + oQ.iw) s.t. w; ^ 0 , (1) 

W * ^ 

i=\ 

where l{-) and ff(-) denote loss function and regularization typically assumed to be convex. The 
number of columns in Ti is in principle unbounded, but it is finite for all finite data sets in practice. 

Popular boosting algorithms commonly choose the £i penalty as regularization, and hence, we 
expect that the minimizer w* is sparse. In fact, good boosting algorithms come with theoretical 
proofs showing that a w that is e-close to w* contains 0(l/e^) non-zero entries (Warmuth et ah, 
2008). At this point it is worthwhile to make a distinction between corrective and totally corrective 
boosting algorithms. Corrective boosting algorithms such as AdaBoost only update one coordinate 
of w at every iteration. On the other hand, totally corrective algorithms such as ERLPBoost (War¬ 
muth et ah, 2008) update all active coordinates at every iteration. Namely, in the f-th iteration they 
solve a f-dimensional optimization problem. Experimental evidence suggests that totally correc¬ 
tive algorithms produce significantly sparser solutions than corrective algorithms. This leads to the 
central question of this paper: What if we explicitly enforced on tn a sparsity-inducing cardinal¬ 
ity penalization (CP), also known as the pseudo-norm (Bach et ah, 2011), in order to produce 
solutions with as few active weak hypotheses as possible? This requires solving 

m 

miny^ UyiHi-w) -|- w -f Acard(t«) s.t. tn A 0 , (2) 

i=\ 

where card(u;) = |{f : tu* / 0}| counts the number of nonzero elements in w, and the ii term with 
a negligibly small but nonzero o avoids ill-posed optimization problems. 

The motivation for our work is twofold. Eirst, the emergence of commercial quantum optimiza¬ 
tion technology offers hope that in the near future one will be able to directly solve problems involv¬ 
ing CP.^ Second, there is a definite need to deploy high-accuracy classifiers in resource-constrained 
environments ranging from mobile and wearable devices (see Subsection 5.2) to micro drones and 
deep-space probes. Because latency, power usage, and processor cycles at prediction time are crit¬ 
ical in such applications, cardinality penalization in boosted ensembles can come with dispropor¬ 
tionate payoffs in terms of the practical results that suddenly become possible. 

The paper is organized as follows. Section 2 reviews background. In Sect. 3 we first study 
the impact of CP on boosting by deriving the Eagrange dual function and then state and discuss 
the TotalQBoost algorithm. In Sect. 4, 5, and 6 we present the experimental setup, list results, and 
conclude. 

2. Background 

Boosting: Shen and Ei (2010); Shen et al. (2013), inspired by the seminal work of Mason et al. 
(2000), cast boosting as general regularized risk minimization over the space T-L of all possible 
columns. Boosting algorithms can in principle incorporate all sensible loss functions and regulariz- 
ers provided the Eagrange duals of the corresponding primal problems can be derived and analyzed. 

1. For now we only experiment with a costly heuristic solver distributed over classical machines. 
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Lagrange duality is a useful tool in the theoretical treatment because the idea of column genera¬ 
tion (CG) is intimately tied to it. Demiriz et al. (2002) established the CG view of boosting and 
showed how it can be used in convergence analysis and derivation of termination criteria for the 
resulting practical algorithms. Moreover, when strong duality holds, the algorithm designer can opt 
for putting the burden of optimization on the dual rather than the primal problem if practical benefits 
are seen. The regularized risk minimization perspective emphasizes the similarity between boosting 
and kernel methods. Boosting, via linear combinations of weak hypotheses from %, also effectively 
maps the original data features onto a high-dimensional space, thus enabling the construction of 
non-linear decision surfaces in the original space. 

Cardinality penalization; As the ultimate sparsity promoter, CP has been considered many 
times in the context of various machine learning problems. However, researchers have avoided 
dealing with it directly because it leads to combinatorial problems not known to be solvable in 
polynomial time by classical computing. Pilanci et al. (2012) propose a convex relaxation for CP and 
use it in recovering sparse probability measures over moment constraints and clustering via sparse 
Gaussian mixtures; Borwein and Luke (2010) propose a convex relaxation consisting of entropic 
regularization of the zero metric and study its convergence to sparse solutions; Zhang and Zhang 
(2012) survey existing statistical theory results on non-convex regularization and construct a general 
theoretical framework for them; Bach et al. (2011) present an extensive treatise on optimization tools 
and techniques dedicated to sparsity-inducing penalties. 

Quantum optimization; D-Wave Systems, Inc. develops quantum optimization hardware. The 
D-Wave machines are not universal quantum computers as their specialized hardware is designed to 
take advantage of limited quantum effects at finite temperature on a restricted Ising model (Brush, 
1967). This is a departure from the idealistic adiabatic quantum algorithm of Farhi et al. (2001) but 
to date is the only practical design to ever achieve ~1000 functional qubits. Up to now there have 
been several studies, e.g. Lanting et al. (2014) and Boixo et al. (2014), thoroughly characterizing 
the underlying quantum effects that are regarded by the wider quantum computing community as 
necessary for achieving computational speedups over classical algorithms. Simultaneously, there 
are ongoing efforts to systematically characterize the actual speedup that the D-Wave machines 
can offer (Rpnnow et ah, 2014). There is also prior machine learning research using quantum 
optimization; robust classification with a non-convex loss function that is compatible with the D- 
Wave architecture (Denchev et ah, 2012); a heuristic boosting algorithm with CP (Neven et ah, 
2012); and training a detector for cars in digital images with a D-Wave optimizer (Neven et ah, 
2009). In general, the potential advantage of adiabatic quantum optimization (AQO) over classical 
algorithms is best understood by comparison with its closest classical analogue—simulated thermal 
annealing (Denchev, 2013). As illustrated in Fig. 1, AQO has at its disposal quantum tunneling, 
which is a fundamentally non-classical resource for escaping local minima in hard optimization 
problems. While thermal annealing may get trapped in a deep local minimum and not have enough 
thermal energy to jump over the tall potential barrier, quantum annealing has a better chance of 
reaching the global minimum with the help of tunneling. 

3. Totally Corrective Boosting with CP 

In Subsection 3.1 we first set the stage for cardinality-penalized boosting by analyzing the Lagrange 
dual function. Surprisingly, we find that the dual is unaffected by CP in the primal. The fascinating 
failure of CP to propagate its influence to the dual provides the verification that sample weights 
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Figure 1: Reprinted from Smelyanskiy et al. (2012). A possible optimization landscape where 
thermal annealing initially gets trapped in a local minimum. Its only escape route is over 
the barrier via thermal fluctuations. In contrast, quantum annealing can escape the local 
minimum via quantum tunneling. 


given by the dual variables are valid, and CG can work correctly in the presence of CR Borwein 
and Luke (2010) similarly observe that going from primal to dual erases all information about the 
behavior of CP. This counter-intuitive result requires investigation from different angles in order 
to be fully understood. To that end, in Appendix A and B we provide two additional arguments 
independently arriving at the same conclusion. 

Subsection 3.2 presents some more detailed background on CG in order to facilitate the under¬ 
standing of the main algorithm. In Subsection 3.3 we look at the duality gap created by the CP term 
in the primal problem and gain useful insights for the design of the boosting algorithm. Finally, in 
Subsections 3.4 and 3.5 we relate early stopping to CP and discuss practical issues stemming from 
conversions between continuous and discrete variables necessitated by the architecture of quantum 
hardware. 

3.1. Lagrange Theory for Boosting with CP 

The primal problem for li- and cardinality-regularized risk minimization is 

m 

min /( 7 j)-h -h Acard(7/) (3) 

w,'y,r] 

2 = 1 

s.t. 'ji = yiTii-w {for i = 1,... ,m), ri = w, in A 0 , 

where /(•) is a loss function; a column stores the { — 1 , 1 } responses of the j-th weak classifier 
on all training examples; and dummy variables 7 and t] allow for arriving at a meaningful dual. The 
Lagrangian, subject to p A 0, is 

m 

L = + As"'" — u'''diag {y) % — p"'"^ w -|- u '''7 -|- I ( 7 ,) — A (^s~^r] — card( 7 /)^ . 

2 = 1 
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(4) 

(5) 


To find the infimum over primal variables w and 7, we set derivatives to zero and obtain: 

«''^diag (y) V. < + As"'^ 

Ui = -I' ■ 

Equation (5) means that a dual variable Ui is exactly the negative derivative of the loss function I at 
margin 7*. With (4) satisfied and I* {—Ui) and card* (a) denoting the Fenchel conjugates of I (7*) 
and card(u;), respectively, 

m 

inf L = — r (—rtj) — Acard*(s) . (6) 

i=l 

Hence, the dual is 

m 

miny^ r (—Uj) + Acard*(s) s.t. (4) . (7) 

U,S ^ 

2 = 1 

It appears card(r 7 ) from the primal impacts the dual via card* (a) and a. However, Lemma 1 shows 
card* (a) = 0 and a = 0. 

Lemma 1 The Fenchel conjugate of CP is 

card*{s)= sup s^w — card{w) 

w&R" 

Proof For a = 0, card* (a) = sup — card(ri;) = 0 

W 

card* (a) = sup s^w — card(u;) > sup s^w — n = 00 

W W 

Equation (7) together with Lemma 1 prove that the dual problem is oblivious to CP: 

Theorem 1 Equations (3) and (1) with Q{w) = l^w; have identical Lagrange duals. 

3.2. Column Generation 

As discussed in Sect. 1, 77 is assumed to be finite but very large, and thus, it is impractical to 
directly solve regularized risk minimization over 77 even when convexity holds. The role of CG, 
which serves as the fundamental optimization framework for boosting algorithms, is to solve the 
full optimization problem over 77 in incremental steps by always considering only a small subset of 
active columns 77 C 77 and augmenting 77 one column at a time. When convex duality holds, only 
columns with nonzero weights in the optimal solution w* need to ever be explicitly considered as 
part of 77, so most columns in 77 can safely be ignored throughout the entire algorithm (Demiriz 
et ah, 2002). 

Hence, at iteration t a 7-dimensional problem, known as restricted master problem (RMP), is 
optimized. After solving an RMP, the algorithm generates the next RMP by finding a column from 
77 that violates a constraint in (4). When convex duality holds we are guaranteed that for all columns 
that are already in 77, solving the current RMP also satisfies the corresponding dual constraints in 
(4). Termination occurs when no such violation by more than e > 0 can be found anymore: 

^7 such that w^diag (y) LL-i > v + e , (8) 


0 for 5 = 0 
00 otherwise 


For any s / 0, since card(w;) < n. 


5 


where v is the l\ regularization coefficient. 

From this perspective, solving primal RMPs in successive iterations corresponds to solving 
increasingly tightened relaxations of the full dual problem over Ti: Violated dual constraints for 
columns in Ff — Ff are brought into consideration and satisfied one af a fime. Upon ferminafion, w* 
implicifly gives zero weighfs fo columns sfill inT-L — H because fheir dual constrainfs are salisfied 
fo wifhin e. Opfimalify follows from fhe facl fhaf af ferminafion we have e-approximafe primal and 
dual feasibilify wifh equal objective values on fhe full problem over Ti. 

3.3. TotalQBoost 

In Ibis paper we insisf on direcfly opfimizing fhe primal cardinalify-penalized RMP af each boosting 
iferafion. However, CP makes fhe problem non-convex and desfroys sfrong dualify. Wifh fhaf, 
we do nof have any guaranfee fhaf CG will converge fo fhe global minimum of fhe cardinalify- 
penalized risk minimization problem even if all RMPs af successive boosting iterafions are solved 
fo opfimalify. We also obfain a fechnical complicafion from fhe facl fhaf solving a non-convex primal 
RMP fo opfimalify does nof guaranfee fhe satisfaction of all dual consfrainfs (4) corresponding fo 
columns in H. Even so, CG in conjunction wifh solving individual RMPs fo opfimalify enables us 
fo lake sfeps and overcome local minima in fhe high-dimensional non-convex space of (3). Even if 
we fail to reach the global minimum of (3), solving non-convex problems along the way is highly 
non-trivial optimization work that results in better boosted ensembles than what is possible with 
convex methods. 

When strong duality does not hold, a quantity of interest is the duality gap, i.e. the difference 
between the optimal values of the primal and dual problems. Eortunately, due to the fact that the 
dual does not change under CP, we can determine a bound on the duality gap solely on the basis of 
the impact of the CP term in the primal. This also gives insight into the design of the algorithm that 
follows. 

Theorem 2 The duality gap 6 between primal (3) and dual (7) is 6 > Xcard(w) with equality 
holding for A such that 

m m 

argminy^ l{yiTii-w) + w + Xcard{w) = argminN^ l{yiHi-w) + v\^ w . (9) 

i=l i=l 

Proof By Theorem 1, the dual of the primal with CP is the same as the dual of the primal without 
CP. Moreover, because we are assuming a convex loss function, strong duality holds for the latter 
primal problem and its duality gap is zero. Hence, if the optimal solution to the primal with CP 
(left-hand side of (9)) gives the best empirical risk under ii regularization only (right-hand side of 
(9)), then 6 = Acard(t/;) because this is the difference in objective values at the common minimum. 
Alternatively, if A is so large that the optimal solution to the primal with CP no longer gives the best 
empirical risk under ii regularization, then <5 > Acard(tu). ■ 


Eigure 2, Left shows the case of weak CP in which the location of the global minimum is 
the same as the minimum of the problem without CP. In this regime if the loss function ((•) is 
strictly convex, the term Acard(t(;) does not have any impact on the output of the boosting algorithm 
because the minimum of the -regularized risk is unique. On the other hand. Pig. 2, Right illustrates 
the case of strong CP, which causes the location of the global minimum to be no longer the same as 


6 



Figure 2: Stylized depictions of primal and dual functions for intuitive understanding. Left: Global 
minimum of primal with CP corresponds to the best empirical risk under l\ regularization 
only (small A). In this case the duality gap is Acard(r/;); Right: Global minimum of 
primal with CP is different from the best empirical risk under £1 regularization only (large 
A). The duality gap is greater than Acard(t«). 


the minimum of the convex primal. Consequently, the corresponding dual point may be suboptimal 
and infeasible, which allows for possible violations of dual constraints (4) corresponding to columns 
in H whose weights are zeroed out by CP. In other words, by forcing some weights in the current 
RMP to zero, the CP term leaves the corresponding dual constraints just as violated as they were 
before adding these columns to H. Since these dual constraints are left violated by the current 
RMP, the weak learner oracle may try to offer the corresponding columns repeatedly in subsequent 
iterations. However, CG still proceeds in a well defined manner if we prevent the oracle from 
offering for the next iterations columns that have already been added to H but have been forced to 
zero weights by the CP term when solving subsequent RMPs. 

In this way, because the CP term forces successive RMPs to carry over violated dual constraints, 
the usual termination by dual constraint satisfaction of the ii regularizing condition (8) is aban¬ 
doned. The CP term may simply be so strong that it could prevent some violated dual constraints 
from ever becoming satisfied. Even so, we can go on wifh CG iferafions while blacklisting gener¬ 
ated columns unfil fhe oracle is no longer able fo offer any previously unseen columns. If fhe weak 
learner dicfionary is of such a large size fhaf ferminafion does nol occur even after many sfagnanf 
iferafions of RMP solufions sficking fo an old subsef of generafed columns while rejecting recenf 
ones, fhen early slopping can be applied in order fo quil solving primal RMPs lhal mighl have grown 
loo large by lhal poinf. This is slafed as Algorilhm 1 : TotalQBoost. 

3.4. Early Stopping 

Even wifhoul an explicif CP lerm in fhe primal, early slopping puls fhe final solution in a local 
minimum of fhe cardinalily-penalized primal for all CP coefficienls A > 0. This is so because 
early slopping ensures all columns lhal could yel be generated are left wilh zero weighls in Ihe final 
ensemble. However, any poinf in fhe primal space wilh al leasl one zero-weighted coordinate is a 
local minimum for card(w;). 

Consider an unregularized CG algorilhm such as AdaBoosl, which is much simpler and in prac¬ 
tice has been known lo give belter generalization lhan Ihe various sludied boosting algorilhms wilh 
explicil convex regularization. While Ihe lack of explicil regularization in AdaBoosl is bound lo 
evenlually cause overfilling as more and more columns are generated, early slopping acls as cardi- 
nalily penalization. Early slopping simply limils Ihe maximum number of columns lhal are lo be 
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Algorithm 1 TotalQBoost 

Require: Loss function l{'y), m training examples {(a;, y)}, dictionary of weak classifiers {/i(s)}, 
regularization parameters u and A, tolerance e, maximum iterations T 
Ensure: Boosted ensemble (if, u;)* 

1: Initialize: iti = ^, Vi = 1,..., to; empty ensemble (if, w); t = 0 

2: while f < r do 

3: Optimize members of {h{x)} with current u 

4: if {/i(x)} — if = 0 or eq. (8) holds then 

5: break 

6: else 

7: Select weak classifier by most violated constraint in eq. (4): 

h{x) = argmaxh(x)e{hix)}-H E™ i UivMx,) 

8: Add h{x) as a new active column: H = {H, h{x)} 

9: end if 

10: Optimize eq. (3) with {(x, ?/)}, H, v, A over discrete variables w 

11: Update w = w 

12: Let H = {hk{x)} for k such that Wk > 0 

13: Optimize -regularized risk with {(x, y)}, H, v, and tolerance e over continuous variables w 

14: Copy the elements of w to corresponding elements of w 

15: Update t = t + I and Ui = —I'{yiHi-w) 

16: end while 

17: {H,w)* = ({/ifc(x)}, {wk}) for k such that Wk > 0 


generated by the algorithm. However, the cardinality penalization that early stopping provides is 
suboptimal, as the first T generated columns are unlikely to be the optimal set of T columns for 
minimizing empirical risk. 

3.5. Discrete Weight Variables 

TotalQBoost uses discrete weight variables w in line 10 of Algorithm 1 because the quantum opti¬ 
mization hardware is engineered as an interconnected collection of physical qubits, each of which 
is regarded as a binary variable at the computational level. Because the current D-Wave hardware 
has just ~1000 functional qubits, each element of w is also restricted to use only a small number of 
bits in a fixed-point representation that implements the discrete variables w via binary expansions 
using the underlying qubits. A practical issue stemming from converting from continuous to dis¬ 
crete variables is that one may lose the ability to represent weight configurations at or close enough 
to the lowest attainable objective value on continuous variables. A possible fix for fhis is fo keep 
increasing fhe bif-depfh and adjusting fhe range of elemenfs of w unfil fhe besf non-CP empirical 
risk over continuous variables can be reached fo wifhin some folerance. We can fhen run cardinalify- 
penalized discrefe opfimizafion. However, even fhough fhe necessary bif-depfh is nof expecfed fo be 
very high (Neven ef ah, 2012), if mighf still be possible for fhis fo resulf in a fofal number of binary 
variables fhaf is loo large fo oplimize effeclively. In fhaf case fhe number of iferafions T fhaf TofalQ- 
Boosf can perform is limifed by fhe size of our discrefe opfimizafion facilities. Assuming fhaf fhe 
variables in w have enough bif-depfh for correcfly selecfing fhe besf subsel of columns, we still have 
fhe chance fo subsequenlly refine fhe weighfs of fhe selecfed columns fo minimize -regularized 
empirical risk. To fhaf end, as specified by lines 12 and 13 of Algorifhm 1, for fhe CP-selecfed 
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columns we run with continuous weight variables w an off-the-shelf convex optimization algorithm 
such as L-BFGS-B (Morales and Nocedal, 2011). 

4. Experimental Setup 

We compare two variants of convex CG and three variants of cardinality-penalized CG per data set. 

A. -regularized CG (LICG): This is the ordinary -regularized CG. We perform multiple 
runs to e-convergence with a variety of regularization coefficients u and record the performance of 
the converged ensembles. 

B. Unregularized CG (UCG) with early stopping; This is also £i-regularized CG but with 
a negligibly small yet nonzero regularization coefficient v. Nonzero u is needed for avoiding ill- 
posed optimization problems in the case of separable data. Early stopping is applied when too many 
columns are generated before reaching the termination criterion (8). We perform a single run up to 
a maximum number of iterations T and record the performance of all intermediate ensembles in 
order to gauge the benefit of early stopping at different times. 

C. Cardinality-penalized CG (CPCG): We apply TotalQBoost with a negligibly small but 
nonzero ii regularization coefficient v and a variety of CP coefficients A and record the performance 
of the ensembles at termination. Depending on the structure of the dictionary of weak classifiers, the 
weak learner may run out of useful columns before we reach the global minimum of the full primal 
problem with CP over all possible columns (3). Thus, we do not have any guarantee of reaching the 
global minimum of the full problem. On the other hand, if a pre-set maximum number of iterations 
T is reached, we apply early stopping. 

D. Hot-started CPCG: Hot-starting is done for the purpose of possibly getting to a better local 
minimum than the one reached by C. We first perform early-stopping UCG for T' iterations, then 
initialize CPCG with these T' columns, and continue as per the TotalQBoost algorithm. Termination 
likely results in a different local minimum than the one that C. produces for the same A. Here, too, 
we repeat for a variety of CP coefficients A and record performance after termination of each. 

E. Subset selection: We take the columns generated in the first phase of D., solve the primal 
RMP corresponding to (3) only once for all different A values, and record the performance of the 
resulting ensembles. This experiment is computationally cheaper than C. and D. because we solve 
only a single discrete optimization problem per A value. The purpose is to gauge the suboptimality 
of ensembles produced by B. as well as the worthiness of the extra effort exerted by D. 

TotalQBoost as well as experiments C., D,, E. described above assume the existence of a discrete 
optimization method that solves to optimality the combinatorial problems arising from including di¬ 
rect CP in the learning problem. The main motivation for attempting to study a machine learning 
algorithm giving rise to combinatorial problems lies in the recent advances in physical implementa¬ 
tions of scalable quantum computing, as introduced in Sect. 2. However, because a sufficiently large 
quantum machine was not available to us at the time of writing, we resorted to a distributed heuristic 
method attempting to solve classically the problems that would otherwise be delegated to quantum 
hardware. Hence, it is important to note that a potential weakness of the experimental results ob¬ 
tained in this manner is that we have no guarantee the cardinality-penalized optimization problems 
are solved to optimality. Consequently, in the actual comparison between the five experiments, 
the results ascribed to experiments C.-E. are probably not optimal. Nevertheless, any advantages 
manifested in the results of C.-E. can certainly be taken as evidence for the potential performance 
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improvements to be delivered by cardinality-penalized totally corrective boosting whenever quan¬ 
tum optimization hardware becomes available to the machine learning practitioner. 

5. Results 

The classical heuristic method used as a stand-in for quantum optimization hardware is Multistart 
Tabu Search (Palubeckis, 2004). We implemented a specialized distributed version of it and ran 
all experiments on all data sets over the course of a week on a collection of 1,800 commodity- 
grade machines with 16 cores each. Even with such a large amount of computational power we do 
not have optimality guarantees when solving the cardinality-penalized optimization problems, but 
some reasonable amount of tuning is done to the Tabu algorithm in order to ensure a basic level of 
confidence in its solutions. 

The dictionary of weak classifiers is consfrucfed as a collecfion of decision sfumps, each taking 
one of the original features in a given data set. In this work we focus on studying the effect of CP, so 
the choice of loss function is inconsequential. Hence, in all experiments we use the most common 
loss function for boosting—exponential loss.^ The optimization tolerance e used for L-BFGS-B and 
termination of LICG is 5 • 10“^. In order to avoid large discrete optimization problems exceeding 
current optimization capabilities, in the event that another termination criterion is not met, we limit 
the maximum number of iterations for all methods to T = 100. The bit-depth for fixed-point 
discrete variables w is chosen as 6, and their range is tentatively adjusted in successive iterations of 
TotalQBoost based on optimal values of the continuous variables w seen in L-BFGS-B solutions. 
The number of hot-starting columns T' for experiments D. and E. is chosen on a per-data-set basis 
as shown in Appendix C. 

5.1. Results on Public Data Sets 

We perform the experiments listed in Sect. 4 on twelve public data sets commonly used for bench¬ 
marking of boosting algorithms. Due to the expensive nature of our discrete optimization, we can 
only afford one train-validation split of 80-20% per data set. Because of that we do not have aver¬ 
aged test results, but we offer error rates produced on the validation set as indicators of generaliza¬ 
tion. A compressed representation of the results is shown in Fig. 3. As a mock-up of how model 
selection is usually done via extensive cross-validation, when one or more of experiments A.-B. 
and C.-E. produce classifiers wifh fhe same cardinalify for fhe same dafa sef, we selecf a group 
represenfafive according fo fhe principle of Parefo efficiency and draw fhe resulfing Parefo fronfiers 
(Kung ef ah, 1975) respecfively for LICG/UCG and CP. 

Table 1 shows a sampling of fhe mosf nofable sparsify and generalizafion gains provided by CP. 
We define sparsity gain as fhe improvemenf in sparsify given by a CP poinf on fhe Parefo fronfier 
relative fo fhe sparsesf LICG/UCG poinf wifh comparable generalizafion performance. The resulfs 
show various sparsify gains befween 2.5% and 67.65%. Whenever fhere is no LICG/UCG poinf fo 
which fo relafe for sparsify gain, i.e. all LICG/UCG poinfs have worse generalizafion performance, 
we quantify generalization gain as fhe improvemenf in generalizafion given by a CP poinf relafive 
fo fhe besf LICG/UCG poinf across all seen cardinalities. The resulfs show various generalizafion 
gains befween 0.95% and 25%. 

2. Prior work considering D-Wave, e.g. Denchev et al. (2012), was concerned with hardware-compatible loss functions. 

Here we abandon this requirement due to the recent development of the “Blackbox compiler,” first used by McGeoch 

and Wang (2013). 
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Figure 3: Generalization results on twelve public data sets commonly used for benchmarking of 
boosting algorithms. For visualization clarity the axes limits are adjusted to discard un¬ 
interesting regions of severe under- or overfitting. Additionally, the Pareto frontiers of 
LICG/UCG- and CP-generated points are drawn to outline the model selection that can 
be done at different cardinalities. We also tag specific poinfs on fhe CP Parefo fronfier 
whose advanfages over LICG/UCG poinfs are quanfified in Table 1. 
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Table 1: Values in each entry are: generalization error of CP-generated point lying on Pareto fron¬ 
tier (%), its cardinality, and its sparsity gain (%) relative to the sparsest LICG/UCG point 
of comparable generalization. The rows are denoted by IDs that correspond to tags in 
Fig. 3. On the data sets that have entries for *, CP also achieves generalization unattain¬ 
able by the LICG/UCG methods at any cardinalities up to 100. For such CP points we 
cannot compute well-defined sparsity gains, but we point out their generalization improve¬ 
ments relative to the best LICG/UCG points across all seen cardinalities: banana 3.52%, 
twonorm 9.8 %, waveform 0.95 %, flare solar 25 %. For these and other points it is possi¬ 
ble also to compute relative generalization gains on comparable cardinalities, but we leave 
that to visual inspection of Fig. 3. breast cancer, heart, and thyroid do not have any CP 
points with sparsity gains, diabetes, german, and splice each have only one or two CP 
points with sparsity gains. The rest of the data sets have numerous CP points with sparsity 
gains, out of which we pick the top five by generalization error to list here. 


ID 

banana 

image 

ringnorm 

twonorm 

waveform 

* 

25.85, 33, N/A 

N/A 

N/A 

3.10, 52, N/A 

10.47, 42, N/A 

1 

26.79,31,18.42 

2.75, 28, 61.64 

3.26,51,31.08 

3.67, 34, 20.93 

10.67, 38, 28.30 

2 

26.89, 12, 67.57 

3.25, 29,17.14 

3.60, 60,16.67 

3.81, 36,16.28 

10.78, 37, 30.19 

3 

27.17,21,40.00 

3.50, 26, 23.53 

3.67, 46, 35.21 

3.87, 39, 2.50 

10.98, 35, 20.45 

4 

27.26, 11,67.65 

3.75, 15, 55.88 

3.74, 55, 20.29 

3.94, 38, 2.56 

11.28, 32, 20.00 

5 

27.36, 23, 32.35 

4.50, 17, 5.56 

3.87, 58, 3.33 

4.01, 35,10.26 

11.38, 33,2.94 

ID 

diabetes 

german 

splice 

flare solar 


* 

N/A 

N/A 

N/A 

20.95, 3, N/A 


1 

16.00, 14, 41.67 

25.91, 2, 60.00 

6.73, 13, 7.14 

N/A 


2 

18.67, 5,16.67 

N/A 

N/A 

N/A 



To compare the optimization success of experiment E. against B., we counted the number of 
times E. is worse than B. at coinciding cardinalities and divided by the total number of times E. and 
B had coinciding cardinalities. This measurement yielded only 1.61% on empirical risk and 2.05% 
on training error, which confirms fhe view of early slopping as simply unopfimized CP. We did fhe 
same comparison for experimenls D. and E. On empirical risk, D. losl from E. 23.81% of fhe lime 
and on Iraining error 13.01% of fhe lime. We allribule Ihese numbers mainly lo our inabilily al 
presenf fo fully and reliably optimize fhe generafed CP problems. 

5.2. Results on Project Glass 

Lastly, we saw a significant impact of CP also on boosted cascades for the eye gesture detector used 
in Google’s Project Glass. The wearable device imposes severe memory, energy, and processing 
restrictions, so the trained detector is required to be maximally compact at any desired detection 
accuracy. In comparisons with corrective and totally corrective versions of convex boosting with 
early stopping, we observed that CP optimizations on cascades of cardinality ~20 using training 
sets of ~10^ examples and ~10'^ features can reduce false detections by 15-45% at fixed recall. 
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6. Conclusion 


Historically, in comparisons between AdaBoost and explicitly regularized boosting algorithms, e.g. 
Duchi and Singer (2009), what we call here UCG with early stopping has been known to yield better 
results than LICG. While it has been somewhat of a mystery why a much simplified algorithm such 
as AdaBoost is performing well, in the context of cardinality-penalized boosting we concluded that 
early stopping is nothing but suboptimal cardinality regularization. Our experiments with explicitly 
cardinality-penalized CG indicate that there is room for significant improvement as we attempt to 
solve the combinatorial problems arising from CP. Because in this work we used only a classical 
heuristic algorithm in place of quantum hardware, it remains to be seen how much better results 
can be obtained when the truly intended optimization engine for this learning algorithm becomes 
widely available. 
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Appendix A. Primal CP and the Lagrange Dual: Indirect Argument 


We can show that the Eagrange dual function is unaffected by the cardinality term in the primal 
without introducing the dummy primal variables rj used in Subsection 3.1: 

m 

miny^ Z( 7 j) -|- w + Acard(u;) s.t. 7 * = yiTii-w (fori = 1 ,... ,m),w h 0 (10) 

i=\ 

Eet the Eagrangian without the term Acard(tii) be denoted by L{w,'Y,u,p), so that the Ea- 
grangian of (10) is L{w,'y, u,p) + Acard(w;). Then the dual functions of primal without Acard(r(;) 
and with it are denoted respectively by 


F{u,p) = inf L(w,'y,u,p) 


( 11 ) 


and 


F\{u,p) = ini L(w,'y,u,p) -|-Acard(tn). 

tt;,7 


( 12 ) 


A straightforward observation is that L{w,^, u,p) < L{w,'^, u,p) + Acard(to) with equality hold¬ 
ing only for w = 0. We use this in the following: 


Theorem 3 For any {u,p) at which F and F\ are finite, F = 
Proof The Eagrangian without Acard(tn) is linear in w: 


m 

L{w, 7 , u,p) = - it''^diag {y)!! -p~^^w + + , s.t. p A 0 . 

'-V-^ i=l 

Eor 7 ^ 0, we have F{u,p) = —00 and Fx{u,p) = — 00 . Eor = 0, we see that w = 0 
minimizes both L{w,'y,u,p) and L{w,'Y,u,p) + Acard(r(;): 

m m 

F\{u,p) = inf O"''to + u '''7 -|- I ( 7 *) -|- Xcard{w) = infu ^7 -|- y^ I ( 7 *) = F{u,p) . 

It; ,7 ^ ^ 7 * ^ 

i=l 2=1 
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Appendix B. Primal CP and the Lagrange Dual: Intuitive Argument 

Intuitively, the cardinality term could potentially affect the dual function only through its gradient. 
However, because card(tn) contains no gradient information, no influence from it can appear in 
the dual. This is properly understood by studying CP within the realm of distribution theory and 
generalized functions (Lighthill, 1958) as the limit of a sequence: 

card(w) = lim Qq kiw) , (13) 

n 

r7 ^ 

(Iruil + q~^) 

i=l 

The functions Q,q^k{w) for finite q are differentiable everywhere. Moreover, with k > 1, dif¬ 
ferentiability as understood within the theory of generalized functions is maintained for q —)• oo as 
well. According to the definition of CP, we naturally expect = 0 for Wi / 0. At first 

glance, points with tUj = 0 are problematic from the perspective of differentiability. However, in 
the limit of q —)• oo, I is still zero for all defined k: 

_ , , , ^ , dcardiw) , dQ.nk{w) 

Theorem 4 The gradient of cardiw) is -;-= hm —^ -= 0 . 

dw 9^00 dw 

Proof The first equality follows from Definition 6 in (Lighthill, 1958). To prove the second equality, 
the derivative of Hg ^(m) with respect to Wi is 

dDq^kiw)/dwi = q~^ (|mi| + sign{wi). (14) 

q-fc 

Hence, we need to show that lim - ;—^ = 0 , which is obviously true for Wi A 0. 

{\wi\ +q-l) 
q~^ 1 

For Wi = 0, k > 1 results in lim -r— = lim -r—;— =0. ■ 

q-!-oo qQ *-1 q->-oo qQ 


Appendix C. Hot-starting Columns for Experiments D. and E. 


Table 2: Numbers of hot-starting columns used in experiments D. and E. 


banana 

b.cancer 

diabetes 

f.solar 

german 

heart 

image 

r.norm splice 

thyroid 

t.norm 

w.form 

40 

20 

20 

10 

40 

15 

30 

40 30 

5 

40 

40 
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